1 Introduction

Single-cell transcriptome sequencing (scRNA-seq) experiments allow us to discover new cell types and help us understand how they arise in development. The Monocle 3 package provides a toolkit for analyzing single-cell gene expression experiments. In our scDown package, we provide a pseudotime trajectory analysis workflow with a single function that carries out the pre-processing, trajectory analysis, and downstream differential expression analysis. In addition to that, we also generates various high quality plots that are very crucial for the trajectory analysis.

2 Data required to run the monocle3 workflow with default parameters

We need two required information/variables to run the monocle3 function. Users need to load a Seurat object (seurat_obj) containing single-cell RNA sequencing data. The species is set to either “human” or “mouse,” as monocle3 supports only these two species. The metadata column (annotation_column) is defined to specify the cell type or other annotation labels, ensuring the analysis runs with the correct input data (by default it takes annotations from Idents). The users can also specify the working directory for the output files and plots (By default it is current directory). There are also a lot of other optional variables that the users can pass which we are going to talk about in the different workflows below.

# Set the working directory
output_dir <- "."

# Seurat object
seurat_obj <- readRDS("../scDown/inst/extdata/10X43_1_spliced_unspliced.rds")

# Define the species: either "mouse" or "human", since potency method only support these two species for identifying the root node
species <- "mouse"

# PCA Dimensions to preprocess the data
nDim <- 30 

# Metadata cell type column
annotation_column <- "clusters"

3 Default monocle3 workflow with the scRNAseq data

Users can load their scRNASeq data which they would like to run the trajectory analysis. To run the trajectory analysis, the users have to provide the scRNASeq data and species. By default, we would run the trajectory analysis on the whole dataset, identify the root nodes using potency method, and run trajectory analysis using monocle3.

library(scDown)
run_monocle3(seurat_obj = seurat_obj,
             species = species,
             nDim = nDim,
             annotation_column = annotation_column,
             graph_test=TRUE,
             output_dir = output_dir)

Running Time: ~ 5 minutes

Operating System: Rocky Linux 8.9 (Green Obsidian)

  • Product Name: RHEL
  • Product Version: 8.9

CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz

Memory: 10 GB

3.1 Pre-processing and visualization of the single cell data

The first step of construction of the single cell trajectories is the pre-processing step of the data. In this step, the data is normalized and batch corrected (if the user provides the batch metadata in the batch_metadata variable. By Default, it is NULL). After that, the dimensionality reduction is carried out and UMAP coordinates are calculated. The function can also use the UMAP coordinates from the Seurat object (by setting transferUMAP=TRUE). The UMAP plot shows the different cell types in the data after transferring the UMAP coordinates.

Next, the cells are grouped into broad cell clusters represented in the data. Monocle does not assume that all cells in the dataset descend from a common transcriptional “ancestor”. In many experiments, there might in fact be multiple distinct trajectories. The monocle functions uses a technique called community detection to group cells. This approach was introduced by Levine et al as part of the phenoGraph algorithm. The UMAP plot shows the different partitions in the data.

3.2 Construction and visualization of single-cell trajectories

Next, the trajectories in the single-cell data is learned by fitting a principal graph within each partition. These trajectories are further used to fit or order the cells in the pseudotime. Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation. In order to do that, monocle needs a root node which acts as a starting point for the differentiation or the biological process. The UMAP plot shows the different nodes along the trajectories.

In our workflow, we have three methods to identify the root node in the single cell data. By default (rootNode_method = "potency"), we use the methods developed by Teschendorff et al. which uses the their gene expression profiles and a protein-protein interaction network to automatically identify the root node based on the potency. Below we show the UMAP plot that ordered the cells based on the root node selected using potency.

The second method is to automatically identify the root node which is most heavily occupied by early time point cells ((rootNode_method = "rootNodes")). For that, the users need to provide the early timepoint category (timepoint) and the meta data column (timePoint_metadata) in the seurat object. The third method is that users can provide the root name by selecting and passing the principal point (rootNode) on the trajectory in the trajectory UMAP plot.

3.3 Identifiation of significant trajectory-variable genes and modules

The pipeline also identifies the genes that vary between groups of cells across the trajectories using the spatial autocorrelation analysis called Moran’s I developed by Cao & Spielmann et al. This method identifies genes that vary between groups of cells in UMAP space and is effective in finding genes that vary in single-cell RNA-seq datasets. All the significant trajectory-variable genes were exported in a csv file (monocleDEG_significant_by_trajectory.csv).The plot below shows the expression of the top trajectory-variable genes in a feature plot.

We also visualized the expression levels of top trajectory-variable genes across the pseudotime and colored as cell types.

The set of the trajectory-variable genes is further grouped into modules using Louvain community analysis. The clustering information is exported in a csv file (monocleDEG_by_trajectory_moduleInformation.csv). The module heatmap below shows the expression levels of each module across the cell types.

3.4 Visualization of cell distribution across the pseudotime

The function will also generate plots that shows the distribution of the cells across the pseudotime. Below the density plot shows the distribution of cells across the pseudotime for the complete dataset separated by the timepoints.

Below the histogram shows the distribution of cells across the pseudotime for the complete dataset per cell type separated by timepoints.

The workflow also generates the distribution of cells per condition too in the output folder.

4 Comparison of trajectories between different groups

In this part, we will show how the pipeline can be used to analyze the trajectories between the different conditions and compare the expression of the significant trajectory variable genes across the conditions using regression analysis.

4.1 Construction and visulization of single-cell trajectories per group

The users need to provide the conditions and condition metadata in the run_monocle3 function to carry out the trajectory analysis across different conditions. This workflow also need the deg_method (By default, deg_method = "quasipoisson") to carry out the regression analysis to identify those trajectory variable genes that differ between the conditions. In our test data, we selected the age of the mouse in days (12 and 35 days) as the conditions.

conditions <- c("12","35") # a string specifying a column in seurat_object@meta.data that contains the condition names, or NULL when no condition is filled.
condition_metadata <- "age.days."
deg_method = "quasipoisson"

library(scDown)
run_monocle3(seurat_obj=seurat_obj,
             species=species,
             nDim = nDim,
             annotation_column = annotation_column,
             graph_test=TRUE,
             conditions = conditions,
             group_column=condition_metadata,
             deg_method = "quasipoisson",
             output_dir = output_dir)

Running Time: ~ 8 minutes

Operating System: Rocky Linux 8.9 (Green Obsidian)

  • Product Name: RHEL
  • Product Version: 8.9

CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz

Memory: 10 GB

The UMAP plot shows the different cell types in the data at timepoint 12.

Below we show the UMAP plot that ordered the cells based on the root node selected using potency at timepoint 12.

4.2 Visualization of cell distribution across the pseudotime separated by conditions

When conditions are supplied, the function will also generate plots that shows the distribution of the cells across the pseudotime separated by conditions. Below the density plot shows the distribution of cells across the pseudotime for the complete dataset separated by the timepoints.

Below the histogram shows the distribution of cells across the pseudotime for the complete dataset per cell type separated by timepoints.

The workflow also generates the distribution of cells per condition too in the output folder.

4.3 Identifiation of significant differential trajectory-variable genes between conditions

When conditions are supplied, the pipeline also identifies the significant differential trajectory-variable genes between conditions by fitting a regression model (using the conditions metadata) to each identified trajectory-variable genes. All the significant differential trajectory-variable genes were exported in a csv file (monocleDEG_significant_by_age.days.+trajectory_12_35.csv).The plot below shows the expression of the top trajectory-variable genes in a feature plot.

We also visualized the expression levels of top trajectory-variable genes across the pseudotime and colored as cell types.

5 Regression analysis between various groups in the scRNASeq data

The pipeline can also be used to identify the significant differential expression genes between conditions using a generalized linear model using “quasipoisson” distribution by default. The users can use four different models (“quasipoisson”, “negbinomial”,“poisson” or “binomial”) by supplying it in deg_model variable. This is an optional feature which can be turned on by supplying the metadata in the metadata_deg_model.

conditions <- c("12","35") # a string specifying a column in seurat_object@meta.data that contains the condition names, or NULL when no condition is filled.
condition_metadata <- "age.days."
deg_method = "quasipoisson"

library(scDown)
run_monocle3(seurat_obj=seurat_obj,
             species=species,
             nDim = nDim,
             annotation_column = annotation_column,
             deg_method = "quasipoisson",
             metadata_deg_model=condition_metadata,
             output_dir = output_dir)

Running Time: ~ 8 minutes

All the significant differential expressed genes were exported in a csv file (monocleDEG_significant_by_age.days..csv).The plot below shows the expression of the top differential expressed genes in a feature plot.

We also visualized the expression levels of top differential expressed genes in violin plot.

6 Comparison of trajectories between different cell types groups

In this workflow, the users can also provide list of combinations of cell types between which they would like to analyze using the trajectory analysis workflow.

library(scDown)

cell_types_monocle<-list(c("Granule immature","Granule mature"))
library(scDown)
run_monocle3(seurat_obj=seurat_obj,
             species=species,
             nDim = nDim,
             annotation_column = annotation_column, 
             conditions = conditions,
             group_column=condition_metadata,
             deg_method = "quasipoisson",
             celltype_groups = cell_types_monocle,
             graph_test=TRUE,
             output_dir = output_dir)

Running Time: ~10 minutes

Operating System: Rocky Linux 8.9 (Green Obsidian)

  • Product Name: RHEL
  • Product Version: 8.9

CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz

Memory: 10 GB

6.1 Construction and visualization of single-cell trajectories

As mentioned previously, the trajectory in the cell type groups is learned by fitting a principal graph within each partition which are further used to fit or order the cells in the pseudotime. The UMAP plot shows the different nodes along the trajectories for the cell type groups.

In our workflow, we used the potency method to find the root node (rootNode_method = "potency"). Below we show the UMAP plot that ordered the cells based on the root node selected using potency. As expected, the root node is found in the Granule immature cells.

6.2 Identifiation of significant trajectory-variable genes and modules

After that, the pipeline identifies the genes that vary between cell types across the trajectories using the spatial autocorrelation analysis. All the significant trajectory-variable genes were exported in a csv file (monocleDEG_significant_by_trajectory_Granule immature_Granule mature.csv).The plot below shows the expression of the top trajectory-variable genes in a feature plot.

We also visualized the expression levels of top trajectory-variable genes across the pseudotime and colored as cell types.

The set of the trajectory-variable genes is further grouped into modules using Louvain community analysis. The clustering information is exported in a csv file (monocleDEG_by_trajectory_moduleInformation_Granule immature_Granule mature.csv). The module heatmap below shows the expression levels of each module across the two cell types.

6.3 Visualization of cell distribution across the pseudotime

The function generate plots that shows the distribution of the cells across the pseudotime. Below the density plot shows the distribution of cells across the pseudotime for the cell types separated by the timepoints.

Below the histogram shows the distribution of cells across the pseudotime for the cell types separated by timepoints.

7 Output folders and files explained

All outputs of the pipeline will be in the monocle folder, which is divided into:

  • csv
  • images
  • figures
  • rds

The csv subfolder contains:

  • full DEG analysis results and from them the significant DEGs per specified model and trajectory.

The images subfolder contains :

  • cellDistribution: cell and cell type distribution density plots plus histogram per (subsetted) object and condition.

  • DEG:

    • significant DEGs found by regression analysis on predefined models (defined in universal variable @DEG_models), visualized in feature plot and violin plot. This is a general linear regression method, and might work better for models that have continuous values.
    • significant DEGs found by graph auto-correlation analysis along the trajectory/pseudotime, visualized in feature plot, and plotted their expression level changes along pseudotime.
      • These trajectory-variable genes are subsetted to run DEG analysis with regression again, in an effort to identify genes that are differentially expressed along pseudotime AND between two conditions.
  • pseudotime: umap by cell types, partitions, learned trajectory, and pseudotime.

The rds subfolder contains:

  • rds: Rds files with the monocle object with the trajectory results

8 Summary

This document outlines a pipeline for analyzing trajectory analysis using monocle3. The pipeline begins by defining key variables, such as the working directory, Seurat object, species, and metadata for cell type annotations. It runs monocle3 on the entire dataset to explore global trajectory analysis and generates visualizations like UMAPs, density, histogram, heatmaps, and violin plots, which highlight trajectories, expression of genes, distribution of cells across pseudotime, and expression of differential expressed genes.

Additionally, the pipeline supports focused analysis by selecting specific cell types or comparing user-defined groups (e.g., age groups 35 and 12). Pairwise comparisons are made to examine differences in trahetcory differences. This pipeline provides a comprehensive tool for exploring and comparing trajectory analysis.

The analysis was performed on a machine with the following specifications:

  • Operating System: Rocky Linux 8.9 (Green Obsidian)
    • Product Name: RHEL
    • Product Version: 8.9
  • CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
  • Memory: 10 GB